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Abstract 



Interest in niultioutput kernel methods is increasing, whether under the guise of multitask 
learning, multisensor networks or structured output data. From the Gaussian process 
perspective a multioutput Mercer kernel is a covariance function over correlated output 
functions. One way of constructing such kernels is b ased on convolution processes (CP). A 



Alvarez and Lawrence 



(20091 recently 



key problem for this approach is efficient inference, 
presented a sparse approximation for CPs that enabled efficient inference. In this paper, 
we extend this work in two directions: we introduce the concept of variational inducing 
functions to handle potential non-smooth functions involved in the kernel CP construction 
and we consider an alternative approach to approximate inference based on variational 



methods, extending the work by Titsias (2009 1 to the multiple output case. We demonstrate 
our approaches on prediction of school marks, compiler performance and financial time 
series. 



1. Introduction 

Gaussian processes (GPs) are flexible non-parametric models wliich allow us to specify prior 
distributions and perform inference of functions. A limiting characteristic of GPs is the fact 
that the computational cost of inference is in general 0(A^'^), being the number of data 
points, with an associated storage requirement of 0{N'^). In recent years a lot of progress 
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1 has been made with approximations that 



allow inference in 0{K'^N) (and associated storage of 0{KN), where X is a user specified 
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number. This has made GPs practical for a range of larger scale inference problems. 
In this paper we are specifically interested in developing priors over multiple functions. 
While such priors can be trivially specified by considering the functions to be independent, 
our focus is on priors which specify correlations between the functions. Most attempts to 



apply such priors so far ( Teh et al. 2005 Osborne et al. 2008 Bonilla et al. 2008 1 have 



focussed on what is known in the geostatistics community as "linear model of coregionaliza- 



tion" (LMC) ( Journel and Huijbregts 1978 Goovaerts 1997 1 . In these models the different 
outputs are assumed to be linear combinations of a set of one or more "latent functions" so 
that the dth output of the function, (x) is given by 



Q 



fd (x) = ^ ad,qUq (x) 



(1) 



9=1 



where Ug (x) is one of Q latent functions that, weighted by {cLd,q}^^i^ sum to form each of the 
D outputs. GP priors are placed, independently, over each of the latent functions inducing a 



correlated covariance function over {fd{'^)}d=i- Approaches to multi-task learning arising 



in the kernel community (see for example Evgeniou et al. , 2005 ) can also be seen to be 
instances of the LMC framework. 

We wish to go beyond the LMC framework, in particular, our focus is convolution pro- 
cesses (jHigdon! [2002 Boyle and Prean 2005 ) . Using CPs for multi-output GPs was pro- 



posed by |Higdon| (j2002| and introduced to the machine learning audience by Boyle and 
Frean (20051. Convolution processes allow the integration of prior information fro m phys- 
ical models, such as ordinary differential equations, into the covariance function. Alvarez 



et al. 


(2009 


), inspired by 


Gao et al. 


(2008 



differential equations, as well as partial differential equations, can be accommodated in a 
covariance function. Their interpretation is that the set of latent functions are a set of latent 
forces, and they term the resulting models "latent force models" . The covariance functions 
for these models are derived through convolution processes (CPs). In the CP framework, 
output functions are generated by convolving Q latent processes {ng(x)}^^^ with kernel 
functions Gd^g(x), associated to each output d and latent force q, so that we have 



/j(x) 



9=1 



Gd,q (X - Z) Uq (z) dz. 



(2) 



The LMC can be seen as a particular case of the CP, in which the kernel functions Gd,q{^) 
correspond to scaled Dirac (5-function Gd,q{'^ — z) = ad^q5{'x. — z). In latent force models 
the convolving kernel, Gd,r{')i is the Green's function associated to a particular differential 
equation. 

A practical problem associated with the CP framework is that in these models inference 
has computational coinplexit y O(A^'^D^) and storage requirements 0{N'^D'^). Recently, 
Alvarez and Lawrence (2009) introduced an efficient approximation for inference in this 



multi-output GP model. The idea was to exploit a conditional independence assumption 
over the output functions {fd{'^)}^=i given a finite number of observations of the latent 



1. Not kernels in the Mercer sense, but kernels in the normal sense. 
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{uq (xfc)}^^^ 1 . This led to approximations that were very similar in spirit to 
the PITC and FI TC ap proximations of Snelson and Ghahramani ] 2006|); Quihonero Candela 



and Rasmussen 



( 2005 1 . In this paper we build on the work of 



Alvarez and Lawrence 



Their 



approximation was inspired by the fact that if the latent functions are observed in their 
entirety, the output functions are conditionally indep endent of one another (as caii be seen 
in Q). We extend the previous work presented in Alvarez and Lawrence (2009) in two 
ways. First, a problem with the FITC and PITC approximations can be their propensity to 
overfit when inducing inputs are optimized. A solution to this problem was given in recent 
work by Titsias (2009) who provides a sparse GP approximation that has an associated 
variational bound. In this paper we show how the ideas of Titsias can be extended to 
the multiple output case. Second, we notice that if the locations of the inducing points, 
{xfc}^-)^, are close relative to the length scale of the latent function, the PITC approximation 
will be accurate. However, if the length scale becomes small the approximation requires 
very many inducing points. In the worst case, the latent process could be white noise (as 
suggested by Higdon ( 2002 ) and implemented by Boyle and Frean ( 2005 1 ) . In this case the 
approximation will fail completely. We further develop the variational approximation to 
allow us to work with rapidly fluctuating latent functions (including white noise). This is 
achieved by augmenting the output functions with one or more additional functions. We 
refer to these additional outputs as the inducing functions. Our variational approximation 
is developed through the inducing functions. There are also smoothing kernels associated 
with the inducing functions. The quality of the variational approximation can be controlled 
both through these inducing kernels and through the number and location of the inducing 
inputs. 

Our approximation allows us to consider latent force models with a larger number of states, 
D, and data points A^. The use of inducing kernels also allows us to extend the induc- 
ing variable approximation of the latent force model framework to systems of stochastic 
differential equations (SDEs). In this paper we apply the variational inducing kernel ap- 
proximation to different real world datasets, including a multivariate financial time series 
example. 

A similar idea to the inducing function one introduced in this paper, was simultaneously 



proposed by Lazaro-Gredilla and Figueiras-Vidal (2010). Lazaro-Gredilla and Figueiras 



Vidal (2010) introduced the concept of inducing feature to improve performance over the 
pseudo-inputs approach of Snelson and Ghahramani (20061 in sparse GP models. Our 
use of inducing functions and inducing kernels is motivated by the necessity to deal with 
non-smooth latent functions in the convolution processes model of multiple outputs. 



2. Multiple Outputs Gaussian Processes 

Let Yd £ M.^ , where d = 1, . . . , D,he the observed data associated with the output function 
yrf(x). For simplicity, we assume that all the observations associated with different outputs 
are evaluated at the same inputs X (although this assumption is easily relaxed). We will 
often use the stacked vector y = (yi, . . . , yz?) to collectively denote the data of all the out- 
puts. Each observed vector is assumed to be obtained by adding independent Gaussian 
noise to a vector of function values so that the likelihood is p{yd\fd) = -^iydl^djf^d^), 



3 



where is defined via ([2]). More precisely, the assumption in Q is that a function value 
/rf(x) (the noise- free version of yd(x)) is generated from a common pool of Q independent 
latent functions {Mg(x)}^j^, each having a covariance function (Mercer kernel) given by 
kq (x,x'). Notice that the outputs share the same latent functions, but they also have their 
own set of parameters ({ct^^g}^]^, cr^) where cc^^g are the parameters of the smoothing ker- 
nel Gd,q{-)- Because convolution is a linear operation, the covariance between any pair of 
function values /(^(x) and fd'{^') is given by 




This covariance function is used to define a fully-coupled GP prior p(fi, . . . , fo) over all the 
function values associated with the different outputs. 

The joint probability distribution of the multioutput GP model can be written as 

D 

Piiya, mti) = n PiydlWfi, . . . , fD). 

The GP prior p(fi, . . . , fo) has a zero mean vector and a {ND) x (ND) covariance matrix 
Kf^f, where f = (fi, . . . , f/5), which consists oi N x N blocks of the form Kf^^f^, . Elements 
of each block are given by fcj^ j^, (x, x') for all possible values of x. Each of such blocks is 
either a cross-covariance or covariance matrix of pairs of outputs. 

Prediction using the above GP model, as well as the maximization of the marginal likelihood 
p{y) = A^(y|0,Kf^f -I- Xl), where XI = diag((jfl, . . . , al,!), requires 0{N^D^) time and 
0{N'^D'^) storage which rapidly becomes infeasible even when only few hundreds of outputs 
and data are considered. Therefore approximate or sparse methods are needed in order to 
make the above multioutput GP model practical. 



3. PITC-like approximation for Multiple Outputs Gaussian Processes 

Before we propose our variational sparse inference me thod for multioutput GP regr ession 



Alvarez and Lawrence 



(2009). This 



in Section |4j we review the sparse method proposed by 
method is based on a likelihood approximation. More precisely, each output function yrf(x) 
is independent from the other output functions given the full-length of each latent function 
Uq(x.). This means, that the likelihood of the data factorizes according to 



D 



D 



p{y\u) = Y{p{yd\u) = n^'^^'^l^'^)' 



d=l 



d=l 



with U = 



the set of latent functions. The sparse method in 



Alvarez and Lawrence 



(20091 makes use of this factorization by assuming that it remains valid even when we 



are only allowed to exploit the information provided by a finite set of function values, 
Uq, instead of the full-length function Uq(x) (which involves uncountably many points). 
Let Uq, for q = 1,...,Q, be a i^T-dimensional vector of values from the function Uq(x.) 
which are evaluated at the inputs Z = {z^}^^. These points are commonly referred to 
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as inducing inputs. The vector u = (ui,...,uq) denotes ah these variables. The sparse 
method approximates the exact likelihood function p(y|ti) with the likelihood 



D D 



p(y|u) = fjp(yd|u) = JjAA(yd|/2f^|u,s;f^|u + cj^/), 

d=l d=l 

where fif^\^ = Kf^^uK-;i,u and Ef^i^ = Kf^^f^ - Kf^^uK^j^Ku^f^ are the mean and covari- 
ance matrices of the conditional GP priors p(frf|u). The matrix Ku,u is a block diagonal 
covariance matrix where the gth block Ku^^u, is obtained by evaluating kq{z,z') at the 
inducing inputs Z. Further, the matrix Kf^^u has entries defined by the cross-covariance 
function 

Cov[/d(x),Ug(z)] = j^Gd,q{:>c - z')kg{z' ,z)dz . 

The variables u follow the GP prior p(u) = A^(u|0, Ku,u) and can be integrated out to give 
the following approximation to the exact marginal likelihood: 

p{y\e) = AA(y|0, D + Kf,uK„i,Ku,f + S). (3) 

Here, D is a block-diagonal matrix, where each block in the diagonal is given by Kf^ — 
Kf^^uK" ^Ku.f^ for all d. This approximate marginal likelihood represents exactly each 
diagonal (output-specific) block Kf^ while each off diagonal (cross-output) block Kf^ f^, 
is approximated by the Nystrom matrix Kf^^uK^ jjKu^f^, . 

The above sparse method has a similar structure to the PITC approximation introduced 
for single-output regression OQuifionero Candela and Rasmussen 20051. Because of this 



similarity, Alvarez and Lawrence^ (2009 1 call their multioutput sparse approximation PITC 
as well. Two of the properties of this PITC approximation, which can be also its limitations, 
are: 

1. It assumes that all latent functions in u are smooth. 

2. It is based on a modification of the initial full GP model. This implies that the 
inducing inputs Z are extra kernel hyparameters in the modified GP model. 

Because of point 1, the method is not applicable when the latent functions are white noise 
processes. An important class of problems where we have to deal with white noise processes 
arise in linear SDEs where the above sparse method is currently not applicable there. Be- 
cause of 2, the maximization of the marginal likelihood in eq. ([3| with respect to (Z,0), 
where 6 are model hyperparameters, may be prone to overfitting especially when the num- 
ber of variables in Z is large. Moreover, fitting a modified sparse GP model implies that 
the full GP model is not approximated in a systematic and rigorous way since there is no 
distance or divergence between the two models that is minimized 

In the next section, we address point 1 above by introducing the concept of variational 
inducing kernels that allow us to efficiently sparsify multioutput GP models having white 
noise latent functions. Further, these inducing kernels are incorporated into the variational 



inference method of Titsias (2009) (thus addressing point 2) that treats the inducing inputs 
Z as well as other quantities associated with the inducing kernels as variational parame- 
ters. The whole variational approach provides us with a very flexible, robust to overfitting, 
approximation framework that overcomes the limitations of the PITC approximation. 
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4. Sparse variational approximation 

In this section, we introduce the concept of variational inducing kernels (VIKs). VIKs give 
us a way to define more general inducing variables that have larger approximation capacity 
than the u inducing variables used earlier and importantly allow us to deal with white noise 
latent functions. To motivate the idea, we first explain why the u variables can work when 
the latent functions are smooth and fail when these functions become white noises. 
In PITC, we assume each latent function Uq{x.) is smooth and we sparsify the GP model 
through introducing, Ug, inducing variables which are direct observations of the latent 
function, 'Ug(x), at particular input points. Because of the latent function's smoothness, the 
Uq variables also carry information about other points in the function through the imposed 
prior over the latent function. So, having observed Uq we can reduce the uncertainty of the 
whole function. 

With the vector of inducing variables u, if chosen to be sufficiently large relative to the 
length scales of the latent functions, we can efficiently represent the functions {itg(x)}^^ 
and subsequently variables f which are just convolved versions of the latent functions]^ 
When the reconstruction of f from u is perfect, the conditional prior p(f |u) becomes a delta 



function and the sparse PITC approximation becomes exact. Figure 1(a) shows a cartoon 
description of a summarization of Ug(x) by Uq. 

In contrast, when some of the latent functions are white noise processes the sparse ap- 
proximation will fail. If Uq{z) is white nois^it has a covariance function 6{z — z'). Such 
processes naturally arise in the application of stochastic differential equations (see section 
[7]) and are the ultimate non-smooth processes where two values Uq{z) and Uq{z') are uncor- 
related when z ^ z'. When we apply the sparse approximation a vector of "white-noise" 
inducing variables Uq does not carry information about Uq{z) at any input z that differs 
from all inducing inputs Z. In other words there is no additional information in the condi- 



tional prior p{uq{z)\uq) over the unconditional prior p{uq{z)). Figure 1(b) shows a pictorial 
representation. The lack of structure makes it impossible to exploit the correlations in the 
standard sparse methods hke PITC0 

Our solution to this problem is the following. We will define a more powerful form of 
inducing variable, one based not around the latent function at a point, but one given by the 
convolution of the latent function with a smoothing kernel. More precisely, let us replace 
each inducing vector Uq with the variables Xq which are evaluated at the inputs Z and are 
defined according to 

A,(z) = J r,(z-v)n,(v)dv, (4) 

where ^^(x) is a smoothing kernel (e.g. Gaussian) which we call the inducing kernel (IK). 
This kernel is not necessarily related to the model's smoothing kernels. These newly defined 
inducing variables can carry information about Uq{z) not only at a single input location but 



2. This idea is like a "soft version" of the Nyquist-Shannon samphng theorem. If the latent functions were 
bandlimited, we could compute exact results given a high enough number of inducing points. In general 
it won't be bandlimited, but for smooth functions low frecuency components will dominate over high 
frecuencies, which will quickly fade away. 

3. Such a process can be thought as the "time derivative" of the Wiener process. 

4. Returning to our sampling theorem analogy, the white noise process has infinite bandwidth. It is therefore 
impossible to represent it by observations at a few fixed inducing points. 
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(a) Latent function is smooth (b) Latent function is noise 



^'zW * T,(x) = A,(x) 




(c) Generation of an inducing function 

Figure 1: With a smooth latent function as in (a), we can use some inducing variables Uq (red dots) 
from the complete latent process Wq(x) (in black) to generate smoothed versions (for example the 
one in blue), with uncertainty described by p{uq\uq). However, with a white noise latent function 
as in (b), choosing inducing variables (red dots) from the latent process (in black) does not give 
us a clue about other points (for example the blue dots). In (c) the inducing function Ag(x) acts as 
a surrogate for a smooth function. Indirectly, it contains information about the inducing points and 
it can be used in the computation of the lower bound. In this context, the symbol * refers to the 
convolution integral. 



from the entire input space. Figure 1(c) shows how the inducing kernel generates the 
artificial construction Ag(x), that shares some ligth over the, otherwise, obscure inducing 



points. We can even allow a separate IK for each inducing point, this is, if the set of 

ifc=i' 



inducing points is Z = {zk}^^i, then 



= j Tg^ki^k - v)ng(v)dv, 



with the advantage of associating to each inducing point its own set of adaptive pa- 
rameters in Tq^k- For the PITC approximation, this adds more hyperparameters to the 
likelihood, perhaps leading to overfitting. However, in the variational approximation we 
define all these new parameters as variational parameters and therefore they do not cause 
the model to overfit. We use the notation A to refer to the set of inducing functions {Ag}^^. 



7 



If Uq{z) has a white noise [^GP prior the covariance function for Ag(x) is 

Cov[A,(x), A,(x')] = J T,(x - z)r,(x' - z)dz 
and the cross-covariance function between /d(x) and Ag(x') is 

Cov[/d(x), A,(x')] = J Gd,g(x - z)T,(x' - z)dz. 



(5) 



(6) 



Notice that this cross-covariance function, unhke the case of u inducing variables, maintains 
a weighted integration over the whole input space. This implies that a single inducing 
variable Ag(x) can properly propagate information from the full-length process Uq{x.) into 
the set of outputs f . 

It is possible to combine the IKs defined above with the PITC approximation of 



Alvarez 



and Lawrence (2009), but in this paper our focus will be on applying them within the 



variational framework of Titsias (20091. We therefore refer to the kernels as variational 
inducing kernels (VIKs). 



5. Variational inference for sparse multiple output Gaussian Processes. 



We now extend the variational inference method of Titsias (2009) to deal with multiple 
outputs and incorporate them into the VIK framework. 

We compactly write the joint probability model p{{'yd,td}d=i) ^ p(y)f) = p(y|f)p(f)- 
The first step of the variational method is to augment this model with inducing variables. 
For our purpose, suitable inducing variables are defined through VIKs. More precisely, let 
A = (Ai, . . . , Aq) be the whole vector of inducing variables where each Ag is a i^-dimensional 
vector of values obtained according to eq. (Q . The role of \q is to carry information about 
the latent function Uq{z). Each Xq is evaluated at the inputs Z and has its own VIK, Tq(x.), 
that depends on parameters ^t,- We denote these parameters as = {^t,}^ 
The A variables augment the GP model according to 



p{y,{,\)=p{y\f)p{{\\)p{\). 

Here, p(A) = AA(A|0, Ka,a) and Ka,a is a block diagonal matrix where each block K^^.Ag in 
the diagonal is obtained by evaluating the covariance function in eq. ([s]) at the inputs Z. Ad- 
ditionally, p(f|A) = AA(f |Kf^AK^"|^A, Kf^f — Kf^^K^^'^K^.f) where the cross-covariance Kf^A 



is computed through eq. ([6j). Because of the consistency condition J p{f\\)p{X)dX = p(f), 
performing exact inference in the above augmented model is equivalent to performing exact 
inference in the initial GP model. Crucially, this holds for any values of the augmentation 
parameters (Z,0). This is the key property that allows us to turn these augmentation 
parameters into variational parameters by applying approximate sparse inference. 
Our method now follows exactly the lines of Titsias ( 2009 ) (in appendix |A] we present a 
detailed derivation of the bound based on the set of latent functions 'Ug(x)). We introduce 
the variational distribution g(f. A) = p(f | A)(/)(A), where p(f|A) is the conditional GP prior 



5. It is straightforward to generalize the method for rough latent functions that are not white noise or to 
combine smooth latent functions with white noise. 
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defined earlier and (/'(A) is an arbitrary variational distribution. By minimizing the KL 
divergence between g(f , A) and the true posterior p(f, A|y), we can compute the following 
Jensen's lower bound on the true log marginal likelihood: 



Tv{T., 0) = log (y |0, , + _ 1 tr ( 



where X) is the covariance function associated with the additive noise process and K = 
Kf^f — Kf^;^K^"^^K;^^f . Note that this bound consists of two parts. The first part is the 
log of a GP prior with the only difference that now the covariance matrix has a particular 
low rank form. This form allows the inversion of the covariance matrix to take place in 
0{NDK^) time rather than ©(iV^D^). The second part can be seen as a penalization term 
that regulates the estimation of the parameters. Notice also that only the diagonal of the 
exact covariance matrix Kf needs to be computed. Overall, the computation of the bound 
can be done efficiently in 0{N DK"^) time. 

The bound can be maximized with respect to all parameters of the covariance function; 
both model parameters and variational parameters. The variational parameters are the 
inducing inputs Z and the parameters Ot^ of each VIK which are rigorously selected so that 
the KL divergence is minimized. In fact each VIK is also a variational quantity and one 
could try different forms of VIKs in order to choose the one that gives the best lower bound. 
The form of the bound is very similar to the projected process approximation, also known as 



Deterministic Training Conditional approximation (DTC) (Csato and Opper, 2001 Seeger 



et al. 2003 Rasmussen and Williams 20061. However, the bound has an additional trace 
term that penalizes the movement of inducing inputs away from the data. This term 
converts the DTC approximation to a lower bound and prevents overfitting. In what follows, 
we refer to this approximation as DTCVAR, where the VAR suffix refers to the variational 
framework. 

The predictive distribution of a vector of test points, given the training data can also be 
found to be 



p(y*|y,X,Z) = J\f {y^.\^ly,,T,y^) , 



K^f^ + 1]* and A 



with /ly. = Kf,AA-^KAfS;"V and Sy. = Kf,f, -Kf,A (K 

Kaa + KAf^-^Kf A. Predictive means can be computed in 0{NDK) whereas predictive 
variances require 0{NDK'^) computation. 



6. Experiments 

We present results of applying the method proposed for two real-world datasets that will 
be described in short. We compare the results obtained using PITC, the intrinsic coregion- 
alization model (ICM)|^ employed in (Bonilla et al. 20081 and the method using variational 
inducing kernels. For PITC we estimate the parameters through the maximization of the 
approximated marginal likelihood of equation ([3| using a scaled-conjugate gradient method. 

6. The intrinsic coregionalization model is a particular case of the linear model of coregionalization with 
one latent function (Goovaerts 19971. See equation ([l]) with Q = 1. 
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We use one latent function and both the covariance function of the latent process, kq{x, x'), 
and the kernel smoothing function, Grf^q(x), follow a Gaussian form, this is 

A:(x,x') =AA(x-x'|0,C), 

where C is a diagonal matrix. For the DTCVAR approximations, we maximize the vari- 
ational bound Optimization is also performed using scaled conjugate gradient. We 
use one white noise latent function and a corresponding inducing function. The inducing 
kernels and the model kernels follow the same Gaussian form as in the PITC case. Using 
this form for the covariance or kernel, all convolution integrals are solved analytically. 



6.1 Exam score prediction 

In this experiment the goal is to predict the exam score obtained by a particular student 
belonging to a particular school. The data comes from the Inner London Education Author- 
ity (ILEA) 1^ It consists of examination records from 139 secondary schools in years 1985, 
1986 and 1987. It is a random 50% sample with 15362 students. The input space consists 
of features related to each student and features related to each school. From the multiple 
output point of view, each school represents one output and the exam score of each student 
a particular instantiation of that output. 



We follow the same preprocessing steps employed in (Bonilla et al. , 2008 1 . The only features 
used are the student-dependent ones (year in which each student took the exam, gender, 
VR band and ethnic group), which are categorial variables. Each of them is transformed to 
a binary representation. For example, the possible values that the variable year of the exam 
can take are 1985, 1986 or 1987 and are represented as 100, 010 or 001. The transformation is 
also applied to the variables gender (two binary variables), VR band (four binary variables) 
and ethnic group (eleven binary variables), ending up with an input space with dimension 
20. The categorial nature of data restricts the input space to 202 unique input feature 
vectors. However, two students represented by the same input vector x and belonging both 
to the same school d, can obtain different exam scores. To reduce this noise in the data. 



we follow Bonilla et al. (2008) in taking the mean of the observations that, within a school, 
share the same input vector and use a simple heteroskedastic noise model in which the 
variance for each of these means is divided by the number of observations used to compute 
it. The performance measure employed is the percentage of explained variance defined as 
the total variance of the data minus the sum-squared error on the test set as a percentage 
of the total data variance. It can be seen as the percentage version of the coefficient of 
determination between the test targets and the predictions. The performance measure is 
computed for ten repetitions with 75% of the data in the training set and 25% of the data 
in the test set. 



Figure 6.1 shows results using PITC, DTCVAR with one smoothing kernel and DTCVAR 
with as many inducing kernels as inducing points (DTCVARS in the figure). For 50 inducing 
points all three alternatives lead to approximately the same results. PITC keeps a relatively 
constant performance for all values of inducing points, while the DTCVAR approximations 
increase their performance as the number of inducing points increase. This is consistent with 



7. Data is available at http://www.ciimi.bristol.ac.uk/learning-training/multilevel-m-support/ 
.datasets . shtml. 
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the expected behaviour of the DTCVAR methods, since the trace term penahzes the model 
for a reduced number of inducing points. Notice that all the approximations outperform 
independent GPs and the best result of the intrinsic coregionalization model presented in 



(Bonilla et al. 2008 1 . 
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Figure 2: Exam score prediction results for the school dataset. Results include the mean of the 
percentage of explained variance of ten repetitions of the experiment, together with one standard 
deviation. In the bottom, SM X stands for sparse method with X inducing points, DTCVAR refers 
to the DTC variational approximation with one smoothing kernel and DTCVARS to the same 
approximation using as many inducing kernels as inducing points. Results using the ICM model 
and independent GPs (appearing as IND in the figure) have also been included. 



6.2 Compiler prediction performance. 

In this dataset the outputs correspond to the speed-up of 11 C programs after some trans- 
formation sequence has been applied to them. The speed-up is defined as the execution 
time of the original program divided by the execution time of the transformed program. 
The input space consists of 13-dimensional binary feature vectors, where the presence of a 
one in these vectors indicates that the program has received that particular transformation. 
The dataset contains 88214 observations for each output and the same number of input 
vectors. All the outputs share the same input space. Due to technical requirements, it is 
important that the prediction of the speed-up for the particular program is made using few 



observations in the training set. We compare our results to the ones presented in (Bonilla 



et al. 2008) and use N = 16, 32, 64 and 128 for the training set. The remaining 88214 - N 



observations are used for testing, employing as performance measure the mean absolute 
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error. The experiment is repeated ten times and standard deviations are also reported. We 
only include results for the average performance over the 11 outputs. 

Figure [s] shows the results of applying independent GPs (IND in the figure) , the intrinsic 
coregionalization model (ICM in the figure), PITC, DTCVAR with one inducing kernel 
(DTCVAR in the figure) and DTCVAR with as many inducing kernels as inducing points 
(DTCVARS in the figure). Since the training sets are small enough, we also include results of 
applying the GP generated using the full covariance matrix of the convolution construction 
(see FULL GP in the figure). We repeated the experiment for different values of K, but 



show results only for K = N/2. Results for ICM and IND were obtained from ( jBonilla et al. 



20081. In general, the DTCVAR variants outperform the ICM method, and the independent 
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Figure 3: Mean absolute error and standard deviation over ten repetitions of the compiler experiment 
as a function of the training points. IND stands for independent GPs, ICM stands for intrinsic 
coregionalization model, DTCVAR refers to the DTCVAR approximation using one inducing kernel, 
DTCVARS refers to the DTCVAR approximation using as many inducing kernels as inducing points 
and FULL GP stands for the GP for the multiple outputs without any approximation. 



GPs for N = 16, 32 and 64. In this case, using as many inducing kernels as inducing 
points improves in average the performance. All methods, including the independent GPs 
are better than PITC. The size of the test set encourages the application of the sparse 
methods: for = 128, making the prediction of the whole dataset using the full GP takes 
in average 22 minutes while the prediction with DTCVAR takes 0.65 minutes. Using more 
inducing kernels improves the performance, but also makes the evaluation of the test set 
more expensive. For DTCVARS, the evaluation takes in average 6.8 minutes. Time results 
are average results over the ten repetitions. 
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7. Stochastic Latent Force Models for Financial Data 



The starting point of stochastic differential equations is a stochastic version of the equation 
of motion, which is called Langevin's equation: 

^ = -Cf{t) + Su{t), (7) 

where f{t) is the velocity of the particle, —Cf{t) is a systematic friction term, u{t) is a 
random fluctuation external force, i.e. white noise, and S indicates the sensitivity of the 
ouput to the random fluctuations. In the mathematical probability literature, the above is 
written more rigorously as d/(t) = —Cf{t)dt + S(iW{t) where W{t) is the Wiener process 
(standard Brownian motion). Since u{t) is a Gaussian process and the equation is linear, 
f{t) must be also a Gaussian process which turns out to be the Ornstein-Uhlenbeck (OU) 
process. 

Here, we are interested in extending the Langevin equation to model multivariate time 
series. The way that the model in ([7| is extended is by adding more output signals and 
more external forces. The forces can be either smooth (systematic or drift-type) forces or 
white noise forces. Thus, we obtain 

'^ = -D,m+j2Sd,Mt)^ (8) 

where /d(t) is the dth output signal. Each Uq{t) can be either a smooth latent force that is 
assigned a GP prior with covariance function kq{t, t') or a white noise force that has a GP 
prior with covariance function 5{t — t'). That is, we have a composition of Q latent forces, 
where Qs of them correspond to smooth latent forces and Qo correspond to white noise 
processes. The intuition behind this combination of input forces is that the smooth part 
can be used to represent medium/long term trends that cause a departure from the mean 
of the output processes, whereas the stochastic part is related to short term fluctuations 
around the mean. A model that employs Qs = 1 and Qo = was proposed by [Lawrence 



et al. (2007) to describe protein transcription regulation in a single input motif (SIM) gene 
network. 

Solving the differential equation ([s]), we obtain 

Q ft 

fd{t) = e-^'^Vd.o + V Sd,g / e-^^(*-^)n,(z)(iz, 



where fdfi arises from the initial condition. This model now is a special case of the multiout- 
put regression model discussed in sections 1 and 2 where each output signal yd{t) = fd{t) + e 
has a mean function e~^'^^fdfl and each model kernel Gd,g(x) is equal to Sd.qe-~^'^^^~^\ The 
above model can be a l so vie wed as a stochastic latent force model (SLFM) following the 



work of Alvarez et al. ( 2009 1 . 



Latent market forces 

The application considered is the inference of missing data in a multivariate financial data 
set: the foreign exchange rate w.r.t. the dollar of 10 of the top international currencies 
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(c) AUD: Real data and prediction 

Figure 4: Predictions from the model with Qs — I and Qo ~ 3 are shown as sohd hnes for the 
mean and grey bars for error bars at 2 standard deviations. For CAD, JPY and AUD the data was 
artificially held out. The true values are shown as a dotted line. Crosses on the x-axes of all plots 
show the locations of the inducing inputs. 



(Canadian Dollar [CAD], Euro [EUR], Japanese Yen [JPY], Great British Pound [GBP], 
Swiss Franc [CHF], Australian Dollar [AUD], Hong Kong Dollar [HKD], New Zealand Dollar 
[NZD], South Korean Won [KRW] and Mexican Peso [MXN]) and 3 precious metals (gold 
[XAU], silver [XAG] and platinum [XPT])|^We considered all the data available for the 
calendar year of 2007 (251 working days). In this data there are several missing values: 
XAU, XAG and XPT have 9, 8 and 42 days of missing values respectively. On top of this, 
we also introduced artificially long sequences of missing data. Our objective is to model the 
data and test the effectiveness of the model by imputing these missing points. We removed 
a test set from the data by extracting contiguous sections from 3 currencies associated with 
very different geographic locations: we took days 50-100 from CAD, days 100-150 from 



8. Data is available at http : //f x . sauder . ubc . ca/data . html I 
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JPY and days 150-200 from AUD. The remainder of the data comprised the training set, 
which consisted of 3051 points, with the test data containing 153 points. For preprocessing 
we removed the mean from each output and scaled them so that they all had unit variance. 
It seems reasonable to suggest that the fluctuations of the 13 correlated financial time-series 
are driven by a smaller number of latent market forces. We therefore modelled the data 
with up to six latent forces which could be noise or smooth GPs. The GP priors for the 
smooth latent forces are assumed to have a squared exponential covariance function, 

kqit, t') = —L= exp ( - ) , 

where the hyperparameter £q is known as the lengthscale. 

We present an example with Q = 4. For this value of Q, we consider all the possible combi- 
nations of Qo and Qg . The training was performed in all cases by maximizing the variational 
bound using the scale conjugate gradient algorithm until convergence was achieved. The 
best performance in terms of achiving the highest value for J^v was obtained for Qg = 1 and 
Qo = 3. We compared against the LMC model for different values of the latent functions 
in that framework. While our best model resulted in an standardized mean square error 
of 0.2795, the best LMC (with Q=2) resulted in 0.3927. We plotted predictions from the 
latent market force model to characterize the performance when filling in missing data. 
In figure |4] we show the output signals obtained using the model with the highest bound 
(Qs = 1 and Qo = 3) for CAD, JPY and AUD. Note that the model performs better at 
capturing the deep drop in AUD than it does at capturing fluctuations in CAD and JPY. 



8. Conclusions 

We have presented a variational approach to sparse approximations in convolution pro- 
cesses. Our main focus was to provide efficient mechanisms for learning in multiple output 
Gaussian processes when the latent function is fluctuating rapidly. In order to do so, we 
have introduced the concept of inducing function, which generalizes the idea of inducing 
point, traditionally employed in sparse GP methods. The approach extends the variational 



approximation of Titsias (2009 1 to the multiple output case. Using our approach we can per- 
form efficient inference on latent force models which are based around stochastic differential 
equations, but also contain a smooth driving force. Our approximation is flexible enough 
and has been shown to be applicable to a wide range of data sets, including high-dimensional 
ones. 
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Appendix A. Variational Inducing Kernels 

Recently, a method for variational sparse approximation for Gaussian processes learning 



was introduced in Titsias (20091. In this appendix, we apply this methodology to a mul- 
tiple output Gaussian process where the outputs have been generated through a so called 
convolution process. For learning the parameters of the kernels involved, a lower bound for 
the true marginal can be maximized. This lower bound has similar form to the marginal 
likelihood of the Deterministic Training Conditional (DTC) approximation plus an extra 
term which involves a trace operation. The computational complexity grows as 0{NDK^) 
where is the number of data points per output, D is the number of outputs and K the 
number of inducing variables. 

A.l Computation of the lower bound 

Given target data y and inputs X, the marginal likelihood of the original model is given by 
integrating over the latent functioi]|^ 



P(y|X) = / p{y\u,X)p{u)du. 

J u 

The prior over u is expressed as 

p{u) = J p{u\X)p{\)dX. 

The augmented joint model can then be expressed as 

p{y,u,\) = p{y\u)p{u\X)p{X). 
With the inducing function A, the marginal likelihood takes the form 



p(y|X) = / p{y\u,X)p{u\X)p{X)dXdu. 

J u,X 

Using Jensen's inequality, we use the following variational bound on the log likelihood, 

T fry ,f^,^^ f ( P{y\u,^)p{u\X)p{X) 

J^V'(Z,0,0(A = / g(M,A log — dAdn, 

Ju,x q{u,X) 

where we have introduced q{u, A) as the variational approximation to the posterior. Fol- 



lowing Titsias ( 2009 1 we now specify that the variational approximation should be of the 
form 

q{u,X)=p{u\X)cj){X). 

9. Strictly speaking, the distributions associated to u correspond to random signed measures, in particular, 
Gaussian measures. 
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We can write our bound as 



J-v(Z, 0, 0(A)) = Jj{X) jj{u\X) |logp(y|ix) + log ^1 dndA. 
To compute this bound we first consider the integral 

logT(A,y) = / p{u\X)\ogp{y\u)du. 

J u 

Since this is simply the expectation of a Gaussian under a Gaussian we can compute the 
result analytically as follows 

logT(A,y) =Y.Jj>{u\\) |-^log27r- ^log|5]| - ^ tr [^-^ (y,yj - 2y,fJ + f.f/)] | d«. 
We need to compute E„|;^ [f^] and E„|;^ [frffj] • ^u\x [fd] is a vector with elements 

Q r. 

E„|A[/d(Xn)] =^ / Grf,,(x„ - z')E„|;,[^X5(z')]dz'. 
9=1 •'^ 

Assuming that the latent functions are independent GPs, E„|;^[Mg(z')] = Ey^|;y^^[?Xq(z')] = 
A;„,;^(z',Z)K3;i;,^(Z,Z)A,. Then 

Q 

E„|A [/d(Xn)] = ^/dA,(Xn, Z)K^;;,^(Z, Z)A,. 
9=1 

IEw|A [fd] can be expressed as 

E„|;, [fd] = Kf,AK-i A = ad(X, A) = a^. 
On the other hand, ^u\x [frff/] is a matrix with elements 

lEn|A[/d(Xn)/d(Xm)] / Grf^^lXn - z) / Gd_g(x^ - z')E„|;,[Ug(z)M5(z')]dzdz' + arf(x„)Q;d(x^) . 

q=l •'2 •'2 

With independent GPs the term E„|;^[u5(z)'Uq(z')] can be expressed as 

E„|;,[u,(z)«,(z')] = K,u,{2; z') - A:„^A,(z, Z)K-\^(Z, 'L)kl^^{-L', Z). 
In this way 

with kdd = Kf^f^ - Kf^x^^x^Xfa- 
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The expression for logT(A,y) is given as 



logT(A,y) = logAr(y|a,5]) - ^J^^r (s-^K^^) . 

d=l 

The variational lower bound is now given as 



(9) 



A free form optimization over ^(A) could now be performed, but it is far simpler to reverse 
Jensen's inequality on the first term, we then recover the value of the lower bound for opti- 
mized 0(A) without ever having to explicitly optimise (f){X). Reversing Jensen's inequality, 
we have 

J^y (Z, 0) = log AT (y |0, Kf xK-J^KAf + S) - - J] tr (^-^Km) • 

d=l 

The form of 0(A) which leads to this bound can be found as 

0(A)ocAr(y|a,S)p(A) 

= M (A|S;,|yK-j,KAf S- V, ^x\y) = AT (K;,AA-^KAf S- V, KxxA-'Kxx) , 

with = {Kll + KllKM-^-^KfxKll,)'' = KaaA-^Kaa and A = KAA+K^fl^-^KfA. 

A. 2 Predictive distribution 

The predictive distribution for a new test point given the training data is also required. 
This can be expressed as 

p(y*|y,X,Z) = / p{y>,\u)q{u,X)dXdu = p{y^\u)p{u\X)4>{X)dXdu 

J u,\ Ju,\ 

= / ^'(y*!'") / p(w|A)0(A)dA du. 

Ju \.J \ 

Using the Gaussian form for the 0(A) we can compute 

/ p{u\X)4>{X)dX = / Af{u\ku\K^l^X,kuu - kuxK^xkxu) 
Jx Jx 

X Af (KAAA^KAfJ]- V, KaaA^^Kaa) dA 

= M (u|A;„AA-^KAf V, Ku - Kx - A-^) kxu) ■ 

Which allows us to write the predictive distribution as 

p(y*|y,X,Z)= / AA(y*|f*,5]*)Ar(tx|At„|A,5]„|;,)du = AA(y*|/Xy,,I]yJ 

J u 

with = Kf,AA-iKAf5]-V and = Kf,f, - Kj^a {K^i " ^'') ^Af. + 
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A. 3 Optimisation of the Bound 



Optimisation of tlie bound (witli respect to tlie variational parameters and the parameters 
of the covariance functions) can be carried out through gradient based methods. We follow 



the notation of [Brookes (2005) obtaining similar results to Lawrence (2007). This notation 



allows us to apply the chain rule for matrix derivation in a straight-forward manner. The 
resulting gradients can then be combined with gradients of the covariance functions with 
respect to their parameters to optimize the model. 

Let's define G: = vecG, where vec is the vectorization operator over the matrix G. For 
a function .7V(Z) the equivalence between ^'^q^^ and ^'^q.^^ is given through ^'^q.'^^ = 

^"^gc^O •) ■ '^^^ log-likelihood function is given as 



1 



1 



Tvi'L, @) oc -- loglE + KfAK^^KAf I - - tr 



1 



where K = Kff — Kf;vK_^j^K;^f . Using the matrix inversion lemma and its equivalent form 
for determinants, the above expression can be written as 

1 



^v(Z,0) ^hog\}^xx\ - ^log|A| -hog\T.\ - ^tr 



1 

-h - tr 
2 



>^fA 



-In 



-,-1 



yy 



1 



tr ( E^^K 



up to a constant. We can find — g^-^ and — ' applying the chain rule to Tvi'^-,®) 

obtaining expressions for ^qk^^ ' ^dVi^ ^"^^ ^dVi^ ^^"^ combining those with the relevant 
derivatives of the covariances wrt 0, Z and the parameters associated to the model kernels, 



dT 
dG: 



dJ'A OA: 

aATacT 



dG: ' 



(10) 



where the subindex in J^e stands for those terms of J- which depend on E, G is either Kff, 
K;^f or K;^;^ and Sgk is zero if G is equal to Kff and one in other case. For convenience 
we have used = ^i/(Z, 0). Next we present expressions for each partial derivative 



dA: 



(KA,fI]-^®KA,fS-i), 



i((E-HI]-):)^ + i 



dA: 

dJ'A ^ 
dA: 



OA: 
dKxr- 



(KA,f5]-i I) + (I ® KA,fE-i) Ta, 



2^^-> ' dKxr. 



dK 



K 



-1 

A.A 



A-^Kx,f'S 
1 

~ 2 



+ 



5J%f f _ 1 

5Kf,f: ~ ~2 
KxkKA,f5]^ 



Kx.AKA,f 



T 



where C = A^^ + A-^Kx f I]-iyy^5]-iKf xA-i, H = S -yy^ + Kf>^^_wS-Vy^ + 



(Kf^;^A ^K-x fYl ^yy^) and Ta is a vectorized transpose matrix (Brookes 



have not included its dimensions to keep the notation clearer. We can rep^ 



20051 and we 



ace the above 



19 



expressions in (10) to find the corresponding derivatives, so 



5Kf,f: 2 

We also have 

' (C:)^ [{Kx,f^-^ I) + (I K;,,fl]-i) Ta] + ( ( A'^K^.f Vy^^] 



5K;,,f: 2 

-CKx,f^-' + A-iK;,,fI]-Vy^S-i + K-J^K;,,fI]-ij : 
Finally, results for q^^. and are obtained as 
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